--- title: Crime Against Children in India author: dave date: '2018-02-22' featured: "india_children.jpg" featuredalt: "children of India" featuredpath: "img/main" categories: - EDA - R tags: - crime - EDA - India - R slug: crime-against-children-in-india ---
I found this dataset by chance on data.world and it immediately sparked in interest as I have two small children and recently moved to India in 2017. The data is organized by state and specific crime from 2001 to 2012. It is a bit dated and not as granular as I would like (by city would have been nice), but the dataset is still worth exploring and practicing some basic skills.
It should be noted that there generally isn’t any information about how this data was collected. There are certain crimes that appear more prevalent across all states and some for which there is no account. Perhaps people are less likely to report some crimes and more likely to report others. For the purpose of this analysis, I will take the data at face value and make assumptions along the way.
The dataset can be found here.
library(data.world)
library(tidyverse)
library(stringr)
library(maptools)
library(RColorBrewer)
library(gridExtra)
library(ggthemes)
library(plotly)
library(rcartocolor)
As per data.world’s automatically generated notebook, the first step is querying the database and checking what tables are included.
# Datasets are referenced by their URL or path
dataset_key <- "https://data.world/bhavnachawla/crime-rate-against-children-india-2001-2012"
# List tables available for SQL queries
tables_qry <- data.world::qry_sql("SELECT * FROM Tables")
tables_df <- data.world::query(tables_qry, dataset = dataset_key)
# See what is in it
tables_df$tableName
## [1] "crime_head_wise_persons_arrested_under_crime_against_children_during_2001_2012"
Next, we query the table found.
if (length(tables_df$tableName) > 0) {
sample_qry <- data.world::qry_sql(sprintf("SELECT * FROM `%s`", tables_df$tableName[[1]]))
sample_df <- data.world::query(sample_qry, dataset = dataset_key)
sample_df
}
## # A tibble: 494 x 14
## state_ut crime_head `2001` `2002` `2003` `2004` `2005` `2006` `2007`
## <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 ANDHRA PR… INFANTICIDE 1 1 3 0 0 0 1
## 2 ARUNACHAL… INFANTICIDE 0 0 0 0 0 0 0
## 3 JHARKHAND INFANTICIDE 0 0 0 0 0 0 0
## 4 TRIPURA RAPE OF CH… 0 0 0 28 6 28 14
## 5 UTTAR PRA… RAPE OF CH… 820 550 429 602 531 480 694
## 6 UTTARAKHA… RAPE OF CH… 10 8 11 35 25 39 22
## 7 WEST BENG… RAPE OF CH… 11 16 17 17 6 33 43
## 8 TOTAL (ST… RAPE OF CH… 2546 2642 3213 4001 4359 4996 5312
## 9 A & N ISL… RAPE OF CH… 0 0 2 0 6 6 3
## 10 CHANDIGARH RAPE OF CH… 14 5 16 0 23 7 11
## # ... with 484 more rows, and 5 more variables: `2008` <int>,
## # `2009` <int>, `2010` <int>, `2011` <int>, `2012` <int>
Now that we have data to work with, it makes sense to check for missing data, misspellings, and generally reshaping the data to make it easier to work with.
First, I’ll check for NA’s.
# check for NA's
any(is.na(sample_df))
## [1] FALSE
Since there are no NA’s, I’ll move on to checking for duplicates and typos (or duplicates caused by typos) in the state and crime columns. Below, we identify 35 unique states (38 less 3 totals) and 12 unique crimes (also excluding total crime).
# check for duplicates / typos states
sample_df %>%
arrange(state_ut) %>%
select(state_ut) %>%
unique()
## # A tibble: 38 x 1
## state_ut
## <chr>
## 1 ANDHRA PRADESH
## 2 A & N ISLANDS
## 3 ARUNACHAL PRADESH
## 4 ASSAM
## 5 BIHAR
## 6 CHANDIGARH
## 7 CHHATTISGARH
## 8 DAMAN & DIU
## 9 DELHI
## 10 D & N HAVELI
## # ... with 28 more rows
# check for duplicates / typos in crime type
sample_df %>%
arrange(crime_head) %>%
select(crime_head) %>%
unique()
## # A tibble: 13 x 1
## crime_head
## <chr>
## 1 ABETMENT OF SUICIDE
## 2 BUYING OF GIRLS FOR PROSTITUTION
## 3 EXPOSURE AND ABANDONMENT
## 4 FOETICIDE
## 5 INFANTICIDE
## 6 KIDNAPPING and ABDUCTION OF CHILDREN
## 7 MURDER OF CHILDREN
## 8 OTHER CRIMES AGAINST CHILDREN
## 9 PROCURATION OF MINOR GILRS
## 10 PROHIBITION OF CHILD MARRIAGE ACT
## 11 RAPE OF CHILDREN
## 12 SELLING OF GIRLS FOR PROSTITUTION
## 13 TOTAL CRIMES AGAINST CHILDREN
There are number of observations labeled “total” in the states column that I don’t really need so I’ll exclude them when creating a new dataframe (leaving the totals in the crime column). I’ll fix a typo and convert to states and crimes to title case.
#remove totals from state column -- NOTE that I leave the total in the crime column
df <- sample_df[!grepl("TOTAL", sample_df$state_ut),]
# fix typo
df$crime_head[df$crime_head=="PROCURATION OF MINOR GILRS"] <- "PROCURATION OF MINOR GIRLS"
#convert to title case
df$crime_head <- str_to_title(df$crime_head)
df$state_ut <- str_to_title(df$state_ut)
The data table appears to be set up to be readable in Excel (from my point of view). Gathering the years into one variable will make it easier to work with.
df <- df %>% gather("year", df, -state_ut, -crime_head, convert = T)
I am still new to this and I suspect it makes more sense to begin with macro level analysis, but I started by focusing on the state of Tamil Nadu since that’s where I live. I was curious to see what crimes are most prevalent in this state.
df %>%
filter(state_ut == "Tamil Nadu" & year == 2012) %>%
arrange(desc(df))
## # A tibble: 13 x 4
## state_ut crime_head year df
## <chr> <chr> <int> <int>
## 1 Tamil Nadu Total Crimes Against Children 2012 1105
## 2 Tamil Nadu Kidnapping And Abduction Of Children 2012 560
## 3 Tamil Nadu Rape Of Children 2012 333
## 4 Tamil Nadu Murder Of Children 2012 118
## 5 Tamil Nadu Other Crimes Against Children 2012 49
## 6 Tamil Nadu Procuration Of Minor Girls 2012 41
## 7 Tamil Nadu Abetment Of Suicide 2012 2
## 8 Tamil Nadu Infanticide 2012 1
## 9 Tamil Nadu Exposure And Abandonment 2012 1
## 10 Tamil Nadu Foeticide 2012 0
## 11 Tamil Nadu Buying Of Girls For Prostitution 2012 0
## 12 Tamil Nadu Selling Of Girls For Prostitution 2012 0
## 13 Tamil Nadu Prohibition Of Child Marriage Act 2012 0
After identifying the most significant crimes in 2012, I chart how these crimes changed over time.
strip_theme <- theme(strip.background = element_rect(fill = "white", color = "#EDE5CF"),
strip.text = element_text(color = "#54203F", size = rel(1.1)),
panel.border = element_rect(color = "#EDE5CF"))
crimes <- c("Kidnapping And Abduction Of Children",
"Murder Of Children",
"Other Crimes Against Children",
"Procuration Of Minor Girls",
"Rape Of Children")
df %>%
filter((state_ut == "Tamil Nadu") & (crime_head %in% crimes )) %>%
ggplot(aes(year,df)) + geom_line(color = "#54203F") +
facet_wrap(~ crime_head, ncol = 2) +
labs(y = "Count", x = "") +
scale_x_continuous(labels = function(x) as.integer(x)) +
theme_light() + strip_theme +
theme(axis.text.x = element_text(hjust=1))

Kidnapping and rape appear to have the most alarming trajectories. I’m curious what average annual growth looks like.
df %>%
filter(state_ut == "Tamil Nadu", crime_head %in% crimes) %>%
group_by(crime_head) %>%
summarize(CAGR = scales::percent((df[year == 2012] / df[year == 2001]) ^ (1/11) - 1)) %>%
arrange(desc(CAGR))
## # A tibble: 5 x 2
## crime_head CAGR
## <chr> <chr>
## 1 Procuration Of Minor Girls NaN%
## 2 Kidnapping And Abduction Of Children 47.1%
## 3 Rape Of Children 28%
## 4 Murder Of Children 17.5%
## 5 Other Crimes Against Children 12.1%
Note that ‘Procuration Of Minor Girls’ is NaN% since it was 0 in 2001. Kidnappings have grown by almost 50% a year!
To add a little more context, I’ll take a look at kidnapping and abductions by state. Below, I select 12 states that have had the most kidnappings over the 12-year period.
top_k <- 12
high_ka_states <- df %>%
group_by(state_ut) %>%
filter(crime_head == "Kidnapping And Abduction Of Children") %>%
summarise(stotal = sum(df)) %>%
top_n(top_k)
kidnapping_plot <- df %>%
filter(crime_head == "Kidnapping And Abduction Of Children", state_ut %in% high_ka_states$state_ut) %>%
ggplot(aes(x=year,y=df, fill=state_ut, text = paste0("Year: ", year,"\nTotal: ", df))) +
geom_bar(stat='identity') +
labs(title = '', y = 'Number of Crimes', x='') +
scale_x_continuous(labels = function(x) as.integer(x)) +
facet_wrap(~state_ut) +
theme_light() + theme(strip.background = element_rect(color = "#93a1a1")) +
theme(legend.position='none',
axis.text.x = element_text(angle = 90, vjust = 0.5),
axis.ticks.x = element_blank()) +
strip_theme +
scale_fill_manual(values = colorRampPalette(brewer.pal(8, "Dark2"))(top_k))
ggplotly(kidnapping_plot, tooltip = c("text")) %>%
add_annotations(
yref="paper",
xref="paper",
y=1.15,
x=0,
text="Kidnapping And Abduction Of Children by State, 2001 - 2012",
align = "left",
valign = "bottom",
showarrow=F,
font=list(size=19)
) %>%
layout(margin = list(t=80), hovermode='x')
Uttar Pradesh seems to stand out quite a bit, especially in 2012. Taking a closer look, we see it has had more than 4x the number of kidnappings than any other state in 2012!
df %>%
group_by(crime_head) %>%
filter(df > 100) %>%
ungroup() %>%
filter(crime_head == "Kidnapping And Abduction Of Children", year == '2012', df[year=='2012'] > 10) %>%
mutate(state_ut = reorder(state_ut, df)) %>%
ggplot(aes(x=state_ut,y=df)) + geom_bar(stat='identity', fill="#813753") + coord_flip() +
geom_text(aes(y = df, x = state_ut, label = df), nudge_y = 350) +
labs(title = 'Number of Kidnappings And Abductions Of Children by State in 2012',
y = '', x='') +
theme(legend.position='none',
panel.background = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank())

The next question I have is what crimes are most significant in each state? A heatmap (or levelplot) might be the best way to visualize this. This also allows us to visualize the most prevalent crimes throughout India.
level_data <- df %>%
filter(year == '2012', crime_head != "Total Crimes Against Children")
colnames(level_data) <- c("State","Crime","Year","Count")
lplot <- level_data %>%
mutate(State = reorder(State, desc(State))) %>%
ggplot(aes(x=Crime,y=State, z=Count)) +
geom_tile(aes(fill = Count)) +
theme(axis.text.x = element_text(angle=90, hjust=1),
panel.background = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0, face = "bold")) +
scale_fill_gradient(name = "No. of\nCrimes",low="white", high="#54203F") +
labs(x = "", y = "")
ggplotly(lplot, tooltip = c("x","y","z")) %>%
add_annotations(
yref="paper",
xref="paper",
y=1.08,
x=0,
text="Number of Crimes by State - 2012",
align = "left",
valign = "bottom",
showarrow=F,
font=list(size=20)
) %>%
layout(margin = list(t=50))
As you can see, kidnappings and rape seem most significant across India. ‘Other’ crime is also significant – more research is necessary to learn what that comprises. It also appears that about half of the crimes are very low or 0 by count, which makes me suspect that data was unavaliable or that such crimes don’t often get reported or prosecuted.
Shifting to a more macro view, we’ll take a look at total crimes by state over time. I select the top 12 states by cumulative total crime over the period. From the charts below, it appears that Madhya Pradesh and Maharashtra have had higher crime, but with low growth, over time. Crime in Uttar Pradesh, however, has been sporadic and grew significantly between 2010 and 2012.
Again, I’m interested in average annual growth, but here I take a look at total crimes by state. Tamil Nadu comes out on top. That is likely because we’re dealing with smaller numbers, but the trajectory is still quite steep. Uttar Pradesh had an average annual growth in crime of about 6% from 2001 to 2012, but crime fell from 2001 to 2002. Average growth from 2002 to 2012 was about 14.4%, which is more than twice as fast as indicated, but still places in the lower half of the chart below.
# Create vector to highlight first bar in chart
gr_ch_cols <- c("two", rep("one", 14))
growth.tbl <- df %>%
filter(crime_head == "Total Crimes Against Children", year %in% c("2001", "2012"), df[year==2001] > 0) %>%
group_by(state_ut) %>%
summarize(growth = 100 * ((df[year == 2012] / df[year == 2001]) ^ (1/11) - 1) ) %>%
arrange(desc(growth))
growth.tbl %>%
slice(1:15) %>%
mutate(state_ut = reorder(state_ut, growth)) %>%
ggplot(aes(x = state_ut, y = growth)) + geom_bar(stat='identity', aes(fill = gr_ch_cols)) +
scale_fill_manual(values = c("#813753","#54203F")) + coord_flip() +
geom_text(aes(y = growth, x = seq(15,1), label = paste0(round(growth),"%")), nudge_y = -1, color="white" ) +
labs(title = 'Geometric Growth Of Total Crimes Against Children (2001 - 2012)',
y = '', x='') +
theme(legend.position='none',
panel.background = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank())

Since I’m working with geographic data, I’d like to map it to visualize the relationship between crime and neighboring states. First, I have to prepare the dataframes for mapping and load the shape file for the states of India. I found a really helpful blogpost on this here.
# subset df for 2001
total_by_state_01 <- df %>%
filter(crime_head == "Total Crimes Against Children", year == '2001', df[year=='2001'] >= 0) %>%
mutate(state_ut = reorder(state_ut, df)) %>%
select(state_ut, df)
# subset df for 2012
total_by_state <- df %>%
filter(crime_head == "Total Crimes Against Children", year == '2012', df[year=='2012']) %>%
mutate(state_ut = reorder(state_ut, df)) %>%
select(state_ut, df)
# subset df to display median number crime of crimes for entire period
med_by_state <- df %>%
filter(crime_head == "Total Crimes Against Children", df[year=='2001'] >= 0) %>%
group_by(state_ut) %>%
summarise(median = median(df)) %>%
arrange(desc(median))
# load shape file
states.shp <- rgdal::readOGR("India_Shape/IND_adm1.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/dave/R/blog/content/post/India_Shape/IND_adm1.shp", layer: "IND_adm1"
## with 37 features
## It has 12 fields
## Integer64 fields read as strings: ID_0 ID_1 CCN_1
states.shp.f <- fortify(states.shp, region = "ID_1")
# create a temporary datafrome from names and ID's
tem_df <- data.frame(states.shp$ID_1, states.shp$NAME_1)
# join mapping dataframes with tem_df to facilitate merging later
total_by_state <- left_join(total_by_state, tem_df, by=c("state_ut" = "states.shp.NAME_1"))
total_by_state_01 <- left_join(total_by_state_01, tem_df, by=c("state_ut" = "states.shp.NAME_1"))
med_by_state <- left_join(med_by_state, tem_df, by=c("state_ut" = "states.shp.NAME_1"))
# renamed columns for readability
colnames(total_by_state) <- c("state","count","id")
colnames(med_by_state) <- c("state","median","id")
colnames(total_by_state_01) <- c("state","count","id")
# fix ID's that didn't quite match up for each dataframe
fix_states <- function(df){
df$id[df$state == "A & N Islands"] <- 1
df$id[df$state == "Jammu & Kashmir"] <- 14
df$id[df$state == "D & N Haveli"] <- 8
df$id[df$state == "Daman & Diu"] <- 9
df$id[df$state == "Delhi"] <- 25
return(df)
}
total_by_state <- fix_states(total_by_state)
total_by_state_01 <- fix_states(total_by_state_01)
med_by_state <- fix_states(med_by_state)
# I found Tamil Nadu was duplicated so the following code removes all duplicates
total_by_state <- total_by_state[!duplicated(total_by_state),]
total_by_state_01 <- total_by_state_01[!duplicated(total_by_state_01),]
med_by_state <- med_by_state[!duplicated(med_by_state),]
# rename columns in growth table (used for geometric mean previously)
colnames(growth.tbl) <- c("state","growth")
# merge growth figures with dataframes -- I decided not to use this in the end but leave it
# so as not to break anything I can't fix
total_by_state <- merge(total_by_state, growth.tbl, by="state", all.x=T)
total_by_state_01 <- merge(total_by_state_01, growth.tbl, by="state", all.x=T)
med_by_state <- merge(med_by_state, growth.tbl, by="state", all.x=T)
# create and sort tables for mapping
merge_tbl <- merge(states.shp.f, total_by_state, by="id", all.x=T)
merge_tbl_01 <- merge(states.shp.f, total_by_state_01, by="id", all.x=T)
merge_tbl_med <- merge(states.shp.f, med_by_state, by="id", all.x=T)
final.plt <- merge_tbl[order(merge_tbl$order),]
final.plt.01 <- merge_tbl_01[order(merge_tbl_01$order),]
final.plt.med <- merge_tbl_med[order(merge_tbl_med$order),]
First, a comparison between the total number of crimes in 2001 and 2012. Note the grey state just below the center, Telangana. This state was formed from the northwest part of Andhra Pradesh in 2014, after this dataset was created.
map_theme <- theme(panel.background = element_blank(),
plot.title = element_text(size=rel(1.5), hjust = 0.5),
axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank())
plot_2001 <- ggplot() +
geom_polygon(data = final.plt.01,
aes(x = long, y = lat, group = group, fill = count, text = paste0(state,": ",count)),
color = "white", size = 0.25) +
coord_map() +
scale_fill_gradient(name="No. of\nCrimes", limits=c(0,12000), low="#ede5cf", high="#54203F")+
labs(title="", x = "", y="") +
map_theme
plot_2012 <- ggplot() +
geom_polygon(data = final.plt,
aes(x = long, y = lat, group = group, fill = count, text = paste0(state,": ",count,"\nCAGR: ",scales::percent(growth/100))),
color = "white", size = 0.25) +
coord_map() +
scale_fill_gradient(name="No. of\nCrimes", limits=c(0,12000), low="#ede5cf", high="#54203F")+
labs(title="", x = "", y="") +
map_theme
subplot(ggplotly(plot_2001, tooltip = c("text")), ggplotly(plot_2012, tooltip = c("text"))) %>%
add_annotations(
yref="paper",
xref="paper",
y=1.15,
x=0,
text="Number of Crimes in India<br>2001 vs 2012",
align = "left",
valign = "bottom",
showarrow=F,
font=list(size=20)
) %>%
layout(margin = list(t=80))